Rainbow Tour Travel Agency - Analysis of Travel Packages¶

Rainbow Tours S.A. is a Polish travel agency that has been operating since 1990. The company offers lots of different types of trips, including all-inclusive vacations and guided tours, that vary in terms of destination. Most of the travels are organized within Europe or in some popular place outside Europe, for instance to Turkey. Offerts available on the agency website are marked by different popularity among the company customers. The interest in purchasing a specific travel package is reflected in a number of ratings given to the offer by users.

The main objective of the following project is to train several ML models being able to predict whether a specific trip is popular among the agency clients. Furthermore, this project aims to recognise important factors that cheer the trip up, make it encouraging and appealing to a wider range of customers and, as a resut, more popular in general.

To achieve the final goals, the first step involved gathering data. The final dataset was scrapped (extracted from the html code) from the Rainbow Tour website and consists of more than 2 000 travel offerts organized in 2023. Next, the dataset was appropriately prepared and number of reviews on the website was choosen as a criterion of each travel popularity. Because the popular trips are in minority of the agency offer, different methods of dealing with unbalanced data were applied.

In order to better understand and interpret the way that the proposed models make the classification, different methods of explanation and examination of predictive machine learning models were used, inluding Partial Dependence Profiles, Ceteris-Paribus Profiles, Break-Down plots and Shapley Additive Explanations.

1. Web-scraping¶

The first step of the project involves gathering the data from the Rainbow Tour website. It was done with a web-scraping technique using selenium and beautiful-soup packages. The code used for that purpose is presented below.

1.1 Importing libraries¶

In [4]:
# web-scraping packages
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

from bs4 import BeautifulSoup

import pandas as pd
import numpy as np
import requests
import time
import re

import math
import warnings

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_auc_score, roc_curve

import dalex as dx
In [5]:
def notebook_setup():
    # set pandas to display 3 decimal places
    pd.set_option('display.float_format', lambda x: '%.3f' % x)
    # disable warnings
    warnings.filterwarnings('ignore')
    # plot formatting
    sns.set_style("whitegrid")
    sns.set(font_scale=1.1)

notebook_setup()
seed = 123

1.2. Opening a browser¶

In [6]:
# open web browser
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.session_id
In [7]:
# open the Rainbow Tour wenstite
PAGE_URL = 'https://r.pl/szukaj?sezon=lato-2023'
driver.get(PAGE_URL)
driver.maximize_window()
In [8]:
# accept cookies
btn = driver.find_element(By.XPATH, "//*[contains(text(), 'Akceptuj wszystkie')]")
btn.click()

1.2. Expanding list of displayed offers¶

Since the number of displayed offers on the Rainbow website is limited to 10, the code below was used to automatically expand the list of offers using the 'Show More' button. For this purpose, the Selenium library was employed.

The following code allows for the opening of a browser window and expanding the list of offers by clicking the "Show More" button a defined number of times. The "Show More" button on the Rainbow website disappears from view after being clicked, remaining invisible until the website reloads three times to show additional. Website starts to reload offers automatically after scrolling to the bottom of the list. The code considers this behavior and, after clicking the "Show More" button multiple times, scrolls the page up and down to facilitate the loading of additional offers. It then waits for a specified number of seconds for the button to reappear. After fully expanding the list of offers, the HTML code of the page was extracted.

As a result of web scraping, the HTML code of the page containing 2098 offers was obtained, from which essential information about the trips was extracted. The extracted information is stored in a datframe and saved in .csv file.

In [9]:
# expand list of offerts MAX_I times
SHOW_MORE_CLASS_NAME = 'szukaj-bloczki__load-more-button'
MAX_I, MAX_J = 100, 5
SLEEP_TIME = 5

for i in range(MAX_I):
    try:
        # click on "Show more" button
        expand_bar = driver.find_element(By.CLASS_NAME, SHOW_MORE_CLASS_NAME)
        expand_bar.click()
        for j in range(MAX_J):
            # scroll page up and down to trigger the offerts loading
            actions = ActionChains(driver)
            actions.send_keys(Keys.PAGE_UP).perform()
            actions.send_keys(Keys.PAGE_DOWN).perform()
            # wait SLEEP_TIME seconds for website to load
            time.sleep(SLEEP_TIME)
        print(f"List expanded {i+1} times")
    except Exception as ex:
        print(f"Exception during expanding list: {ex}")
        break

1.3. Reading html¶

In [10]:
# retrieve html content
html_content = driver.page_source

# save html to text file
# with open('page_html_content.txt', 'w', encoding="utf-8") as file:
#     file.write(html_content)

# close the browser window
driver.close()
In [11]:
# parse html content with bs4
soup = BeautifulSoup(html_content, 'html.parser')

1.4. Scraping trip attributes¶

In [12]:
# find all blocks with offert
OFFER_BLOCK_CLASS_NAME = 'n-bloczek szukaj-bloczki__element'
offer_blocks = soup.find_all("a", {"class": OFFER_BLOCK_CLASS_NAME})
In [13]:
# class names of proper html tags
TITLE = re.compile('.*r-bloczek-naglowek__tytul.*')
LOCATION = re.compile('.*r-bloczek-lokalizacja.*')
RATING = re.compile('.*r-bloczek-ocena r-bloczek__ocena.*')
PRICE = re.compile('.*r-bloczek-cena__aktualna.*')
STARS = 'r-gwiazdki r-gwiazdki--medium'
DATE = 'r-bloczek-wlasciwosc__dni'
AIRPORT = 'r-bloczek-wlasciwosc'
FOOD = 'r-bloczek-wlasciwosc'
OLD_PRICE = 'r-bloczek-cena__przed-obnizka'
In [14]:
def get_price(offer_block_html):
    offer = offer_block_html
    offer_price = offer.find("span", {"class": PRICE}).get_text().strip()
    price = float(re.search('\d+\ \d+', offer_price).group().replace(' ', ''))
    
    # old price
    price_before = offer.find("span", {"class": OLD_PRICE})
    if price_before is not None:
        price_before = offer_price_before.get_text().strip()
        price_before = float(re.search('\d+\ \d+', price_before).group().replace(' ', ''))
    else:
        price_before = np.nan
        
    return price, price_before

def get_food(offer_block_html):
    offer = offer_block_html
    offer_food = offer.find_all("div", {"class": AIRPORT})
    if len(offer_food) > 2:
        offer_food = offer_food[2].find_all('span')[0].get_text().strip()
    else:
        offer_food = np.nan
    return offer_food

def get_airport(offer_block_html):
    offer = offer_block_html
    airport = offer.find_all("div", {"class": AIRPORT})[1].find_all('span')[0].get_text().strip()
    return airport
    
def get_time(offer_block_html):
    offer = offer_block_html
    # '05.05.2023 (8 days)' format
    offer_date = offer.find("div", {"class": DATE}).find_all('span')[0].get_text().strip() 
    duration = int(re.search('\d+ dni', offer_date).group().replace(' dni', ''))
    date = re.search('[\d\.]+ \(', offer_date).group().replace(' (', '')
    return date, duration

def get_rating(offer_block_html):
    offer = offer_block_html
    offer_rating = offer.find("span", {"class": RATING})
    if offer_rating is not None:
        offer_rating = offer_rating.get_text().strip()
        offer_rating = offer_rating.replace('\xa0', '')
        ratings_no = int(re.search('\(\d+\.*', offer_rating).group().replace('(', ''))
        rating = float(re.search('\d+\.\d+', offer_rating).group().replace(' ', ''))
    else:
        rating = np.nan
        ratings_no = 0
        
    return rating, ratings_no

def get_stars(offer_block_html):
    offer = offer_block_html
    offer_stars = offer.find("div", {"class": STARS})
    if offer_stars is not None:
        offer_stars = float(offer_stars['data-rating'])
    else:
        offer_stars = np.nan
        
    return offer_stars

def get_loc(offer_block_html):
    offer = offer_block_html
    loc = offer.find("span", {"class": LOCATION}).get_text().strip()
    loc = loc.replace('\u205f•\u205f', ' ')
    loc = loc.replace('\u205f•', ' ')
    
    # split location into county an d city
    loc_component = loc.split(":")
    if len(loc_component) > 1:
        country, city = loc_component[:2]
    else:
        country = loc_component[0]
        city = np.nan
        
    # remove words not related to location such as "vacation" or "tour" 
    country = country.replace("Wypoczynek ", "")
    country = country.replace("Objazd + wypoczynek ", "")
    country = country.replace("Objazd ", "")
    
    return country, city

def get_offer_df(offer_block_html):
    """
    Function retrieves information about single offer from html and saves them to pandas data frame.
    """
    offer = offer_block_html
    # offer title
    title = offer.find("span", {"class": TITLE}).get_text().strip()
    # trip destination
    country, city = get_loc(offer)
    # stars
    offer_stars = get_stars(offer)
    # rating      
    rating, ratings_no = get_rating(offer)
    # date
    date, duration = get_time(offer)
    # airport
    airport = get_airport(offer)
    # food
    food = get_food(offer)
    # price
    price, price_before = get_price(offer)
    # create dict with info about a trip
    trips_dict = {
        "title": [title],
        "location": [loc],
        "country": [country],
        "city": [city],
        "stars": [offer_stars],
        "rating": [rating],
        "ratings_no": [ratings_no],
        "date": [date],
        "duration": [duration],
        "airport": [offer_airport],
        "food": [offer_food],
        "price": [price],
        "old_price": [offer_price_before]
    }
    return pd.DataFrame(trips_dict)
In [15]:
# create data frame where each row corresponds with info about a single offer
trips_df = [get_offer_df(offer) for offer in offer_blocks]
trips_df = pd.concat(trips_df, ignore_index=True)
In [17]:
# save data to .csv
trips_df.to_csv("rainbow_summer_2023.csv", index=False)

2. Exploratory data analysis¶

In the second part of the project, a descriptive analysis was conducted on the data related to summer 2023 tours organized by Rainbow Tour travel agency. As part of the analysis, descriptive statistics of features were interpreted and visualized on graphs.

2.1 Dataset description¶

The analyzed dataset consists of 2098 records and 13 columns. Each row describes a single tour available for purchase on the Rainbow Tour travel agency website. All trips are scheduled between May and December 2023.

The columns present in the described dataset contain the following information about the tours:

  • Title – the title of the offer,
  • Location - the destination of the trip (both country and city),
  • Country – the destination country of the tour,
  • City – the destination city of the tour,
  • Stars - the number of stars assigned to the trip by the agency,
  • Ratings - the average user rating on the website,
  • Ratings number - the number of times the tour has been rated,
  • Date - the start date of the tour,
  • Duration - the duration of the journey in days,
  • Airport - the city where the airport is located,
  • Food - the type of food provided during the trip,
  • Price - the current offer price,
  • Old price - the previous offer price (if the trip is discounted).

Some variables, such as "City," "Stars," "Rating," and "Old price," have missing values in the data, which can be expected from the dataset obtained using web-scraping. Tours with missing values in the "City" column usually have multiple destinations. The "Stars" and "Rating" columns may be empty because some offers have never been rated by the agency or any user on the website. The absence of values in the "Old price" column indicates that the tour still has its base price and has never been discounted.

In [2]:
raw_data = pd.read_csv("data.csv")
raw_data[['country', 'city', 'airport', 'food']] = raw_data[['country', 'city', 'airport', 'food']].apply(lambda x: x.str.strip())
raw_data.head()
Out[2]:
title location country city stars rating ratings_no date duration airport food price old_price
0 Paras Sun Wypoczynek Grecja: Rodos Grecja Rodos 2.000 4.500 188 12.05.2023 8 Warszawa Chopin Śniadania 1500.000 NaN
1 Anaxos Bay Wypoczynek Grecja: Lesbos Grecja Lesbos 2.000 5.000 11 30.08.2023 8 Katowice Bez wyżywienia 1558.000 NaN
2 Elena Studios Wypoczynek Grecja: Lesbos Grecja Lesbos 2.000 4.900 25 30.08.2023 8 Katowice Bez wyżywienia 1635.000 NaN
3 Studia Maria Wypoczynek Grecja: Kokkino Nero Grecja Kokkino Nero 2.500 5.200 71 31.05.2023 8 Wrocław Bez wyżywienia 1766.000 NaN
4 Apartamenty Omiś Wypoczynek Chorwacja: Kastelanska Riwiera - Ma... Chorwacja Kastelanska Riwiera - Makarska Riwiera 2.500 3.700 12 22.06.2023 8 Katowice Bez wyżywienia 1779.000 NaN
In [3]:
raw_data.shape
Out[3]:
(2098, 13)

The printout below summarises number of missing values and datatypes of each variable.

In [4]:
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2098 entries, 0 to 2097
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   title       2098 non-null   object 
 1   location    2098 non-null   object 
 2   country     2097 non-null   object 
 3   city        1869 non-null   object 
 4   stars       1621 non-null   float64
 5   rating      1511 non-null   float64
 6   ratings_no  2098 non-null   int64  
 7   date        2098 non-null   object 
 8   duration    2098 non-null   int64  
 9   airport     2098 non-null   object 
 10  food        2083 non-null   object 
 11  price       2098 non-null   float64
 12  old_price   787 non-null    float64
dtypes: float64(4), int64(2), object(7)
memory usage: 213.2+ KB

One of the tours present in the dataset has missing value in "Country" and "City" fields. Their values were concluded after visiting the website with the trip offer and imputed appropriately.

In [5]:
raw_data.loc[693]
Out[5]:
title         Peridis Family Resort
location                Wypoczynek 
country                         NaN
city                            NaN
stars                         5.000
rating                          NaN
ratings_no                        0
date                     26.09.2023
duration                          8
airport                     Wrocław
food                 Bez wyżywienia
price                      3316.000
old_price                       NaN
Name: 693, dtype: object
In [6]:
# update trip with missing data about country and city
# data were imputed based on the resort name
raw_data.loc[693, ["country", "city"]] = ["Grecja", "Kos"]

The table below presents descriptive statistics of the numeric variables from the dataset. Values indicate that the average offer rating by the agency is approximately 4 stars in scale from 0 to 5. There is no trip with number of stars lower than 2. Custmomers rating vary from 0 to 6 with the mean equal to 5,2 and standard deviation equal to 0,5. The average trip has been rated 42 times. Furthermore, the majority of tours is planned for 8 days. Trip prices follow a right-side skewed distribution with a mean price equal to 5,400 PLN. There are some offers that prices exceed 10,000 PLN, though. 787 out of 2098 tour packages is discounted on the website. The mean price of all tours calculated using the base price (before discount) equals approximately 7,500 PLN.

In [7]:
raw_data.describe().T
Out[7]:
count mean std min 25% 50% 75% max
stars 1621.000 4.194 0.752 2.000 4.000 4.000 5.000 5.000
rating 1511.000 5.204 0.513 2.800 4.900 5.300 5.500 6.000
ratings_no 2098.000 42.864 94.447 0.000 0.000 7.000 41.000 1153.000
duration 2098.000 8.137 1.984 1.000 8.000 8.000 8.000 22.000
price 2098.000 5403.184 3341.820 1500.000 3030.000 4174.000 7178.000 31845.000
old_price 787.000 7511.075 3942.276 2447.000 4298.000 7383.000 9109.000 32345.000

The printout below presents information about number of unique values in each column.

In [8]:
raw_data.nunique()
Out[8]:
title         2018
location       440
country        155
city           263
stars            7
rating          31
ratings_no     253
date           199
duration        20
airport         12
food            16
price         1702
old_price      734
dtype: int64

2.2. Data visualization¶

Target Variable - Tour Popularity¶

As mentioned earlier, the aim of this project is to build a classifier capable of identifying tours that are most popular among all Rainbow Tour offerings. The popularity of a trip is determined based on the number of reviews it has received from the agency's customers. If a tour has been reviewed more than 30 times, it is labeled as popular.

Applying this criterion reveals that 621 offers have been labeled as popular. Since the number of popular tours (the positive class share) is significantly smaller, the dataset is considered imbalanced. This imbalance may pose challenges during model training. Therefore, in the subsequent steps of this project, certain techniques have been applied to mitigate the impact of the imbalanced class problem.

In [9]:
raw_data["is_popular"] = np.where(raw_data["ratings_no"] > 30, 1, 0)
target = "is_popular"

# make column names consistent
old_cols = raw_data.columns
# create another dataframe for plotting the data
plot_data = raw_data.copy()

# prettify column names to make them looks better on plots
def prettify_text(text):
    text = text.replace(".", "")
    text = text.replace("_", " ")
    text = " ".join(re.sub(r"([A-Z])", r" \1", text).split())
    text = text[0].upper() + text[1:].lower()
    return text

binary_map = {0: "No", 1: "Yes"}
plot_data[target] = plot_data[target].map(binary_map)
plot_data.columns = [prettify_text(col) for col in old_cols]

plt.figure(figsize=(7, 6))
ax = sns.countplot(data=plot_data, 
                   x="Is popular", 
                   order=plot_data["Is popular"].value_counts().index)
ax.set(ylabel="Count")

plt.show()

Destination of Tours¶

The analyzed offers have various destinations primarily located in European countries. The most popular destinations for vacations organized by Rainbow are Greece (over 250 trips), Spain (approximately 200 trips), and Turkey (also around 200 trips). Among other frequently visited countries are Egypt, Italy, Bulgaria, and Tunisia. Greece and Egypt exhibit the highest percentage of popular tours among all available offers.

Concerning the most frequently visited cities, the dominant ones include the Turkish Riviera, Varadero, and Hurghada. Sicily, Sunny Beach, Tenerife, and the Egyptian Riviera are also popular. The highest percentage of popular tours among all available offers to these cities characterizes Varadero and Hurghada, as well as the Egyptian Riviera.

In [10]:
def words_counter(data, col):
    words_cnt = dict(Counter(data[col]))
    words_cnt_list = [(key, val) for key, val in words_cnt.items()]
    words_cntr_df = pd.DataFrame(words_cnt_list, 
                                columns=[col.title(), "Count"]).sort_values(["Count"], ascending=False)
    return words_cntr_df

country_cnt = words_counter(raw_data, 'country')
city_cnt = words_counter(raw_data, 'city')
city_cnt = city_cnt.dropna()

popular_countires = country_cnt.where(country_cnt['Count'] > 85)['Country'].dropna().unique()
popular_cities = city_cnt.where(city_cnt['Count'] > 39)['City'].dropna().unique()


fig, axs = plt.subplots(2, 2, figsize=(20, 10))

ax = sns.barplot(x='Country', y='Count', data=country_cnt.head(7), ax=axs[0, 0])
axs[0, 0].set(ylabel="Count")
    
ax = sns.barplot(x='City', y='Count', data=city_cnt.head(7), ax=axs[0, 1])
axs[0, 1].set(ylabel="Count")
    
sns.countplot(data=plot_data[plot_data['Country'].isin(popular_countires)], 
              x='Country', hue='Is popular', ax=axs[1, 0], dodge=False, hue_order = ['No', 'Yes'], 
              order=plot_data[plot_data['Country'].isin(popular_countires)]["Country"].value_counts().index)
axs[1, 0].set(xlabel="Country", ylabel="Count")

sns.countplot(data=plot_data[plot_data['City'].isin(popular_cities)], 
              x='City', hue='Is popular', dodge=False, ax=axs[1, 1], hue_order = ['No', 'Yes'], 
              order=plot_data[plot_data['City'].isin(popular_cities)]["City"].value_counts().index)
axs[1, 1].set(xlabel="City", ylabel="Count")
    
plt.show()

Guided tours¶

It has been observed that some tours fall into the category of guided tours, meaning that during vacation customers of the agency visit more than one country. It was decided to label these tours by creating a new column called "Multiple destination." It can be noted that a very small number of offers is organized to more than one country.

In [11]:
for country in raw_data["country"].unique()[-5:]:
    print("-", country)
- Peru, Brazylia, Argentyna, Chile, Boliwia
- Australia
- Australia i Nowa Zelandia
- Singapur, Australia, Nowa Zelandia, Fidżi
- Australia, Zjednoczone Emiraty Arabskie, Nowa Zelandia
In [12]:
# create column indicating if trip has multiple destinations
raw_data['multiple_destination'] = np.where(raw_data["country"].str.contains(", "), 1,
                                               np.where(raw_data["country"].str.contains(" i "), 1, 0))
plot_data['Multiple destination'] = raw_data['multiple_destination'].copy()
plot_data['Multiple destination'] = plot_data['Multiple destination'].map(binary_map)

plt.figure(figsize=(10, 5))
ax = sns.countplot(x='Multiple destination', hue='Is popular', dodge=1, 
                   data=plot_data, hue_order = ['No', 'Yes'])
ax.set(ylabel="Count")
plt.show()

Stars and user ratings¶

From the distribution of number of stars, it is evident that stars were awarded by either the agency expert or a simple algorithm, as integer values predominate (primarily 4 or 5 stars). The tours rated with 4 or 4.5 stars have the greatest popularity.

On the other hand, the density plot of user ratings based on tour popularity indicates that higher-rated tours, around average ratings of 5 - 5.5, tend to be more popular, but not exceptionally so. Tours below a rating of 5 and above 5.5 are often less popular among users. This could be attributed to the fact that frequently booked tours also receive a substantial number of user ratings, and users may assess tours subjectively — rating them higher or lower at different times. As a result, the ratings tend to average out on a some level. If a tour is less popular, it often has a small number of ratings, which can result in a very high or low average rating if a few users rate the tour similarly. Such ratings are less representative.

In [15]:
fig, axs = plt.subplots(1, 2, figsize=(20, 5), gridspec_kw=dict(width_ratios=[1, 2]))

sns.countplot(data=plot_data, x='Stars', hue='Is popular', 
              dodge=0, ax=axs[0], hue_order = ['No', 'Yes'])
axs[0].set(xlabel="Stars", ylabel="Count")
axs[0].legend(title='Is popular', loc='upper left', labels=['No', 'Yes'])

sns.histplot(plot_data.loc[plot_data['Is popular']=="No", "Rating"], ax=axs[1], kde=True)
sns.histplot(plot_data.loc[plot_data['Is popular']=="Yes", "Rating"], ax=axs[1], kde=True, color="darkorange")
axs[1].set(xlabel="Rating")

plt.show()

Tour Duration and Date¶

Analyzing the chart of the months in which tours are organized reveals the surprising fact that the majority of them are scheduled in December. This may be attributed, for example, to the fact that people go on vacation during the Christmas holidays. Many offers are available for May and June, as well as September and October. The most popular departures occur in the spring months of May and June. The most frequent duration of tours is 8-9 days, while a minimal portion lasts for 4-5 days. Other tour lengths occur very rarely.

In [16]:
raw_data['month'] = pd.to_datetime(raw_data['date'], format="%d.%m.%Y").dt.strftime('%B')
plot_data['Month'] = raw_data['month'].copy()
mnth_order = ['May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

fig, axs = plt.subplots(1, 2, figsize=(20, 5), gridspec_kw=dict(width_ratios=[2, 3]))

sns.countplot(x='Month', hue='Is popular', dodge=0, data=plot_data, 
              order=mnth_order, ax=axs[0], hue_order = ['No', 'Yes'])
axs[0].set(ylabel="Count", xlabel='Month')
axs[0].legend(title='Is popular', loc='upper left', labels=['No', 'Yes'])
axs[0].set_xticklabels(mnth_order, rotation=25)

sns.countplot(x='Duration', hue='Is popular', dodge=0, data=plot_data, 
              ax=axs[1], hue_order = ['No', 'Yes'])
axs[1].set(ylabel="Count", xlabel='Duration')
axs[1].legend(title='Is popular', loc='upper left', labels=['No', 'Yes'])

plt.show()

Airport and food¶

The majority of Rainbow Tours depart from Warsaw Chopin Airport or Katowice. However, the departure location does not seem to impact the popularity of the tour.

Regarding meal options, the most commonly available choices are breakfast, breakfast and dinner, or "ultra all-inclusive." Among these three, tours offering breakfast and dinner are the most popular.

In [17]:
raw_data['airport'] = raw_data['airport'].replace('/', '', regex=True)
plot_data['airport'] = raw_data['airport'].copy()

air_cnt = words_counter(raw_data, 'airport')
popular_airport = air_cnt.where(air_cnt['Count'] > 20)['Airport']

food_cnt = words_counter(raw_data, 'food')
popular_food = food_cnt.where(air_cnt['Count'] > 20)['Food']

fig, axs = plt.subplots(1, 2, figsize=(22, 5))
ax = sns.countplot(x='Airport', hue='Is popular', dodge=1, ax=axs[0], 
                   data=plot_data[plot_data['Airport'].isin(popular_airport)], hue_order = ['No', 'Yes'])
ax.set(ylabel="Count", xlabel='Airport')
ax = sns.countplot(x='Food', hue='Is popular', dodge=1, ax=axs[1],
                   data=plot_data[plot_data['Food'].isin(popular_food)], hue_order = ['No', 'Yes'])
ax.set(ylabel="Count", xlabel='Food type')
plt.xticks(rotation=25)

plt.show()

Price¶

The following charts indicate that tours with lower prices are more popular, both in terms of current prices and considering prices before promotions. This is not surprising, as more affordable trips are financially accessible to a larger number of people.

In [18]:
fig, axs = plt.subplots(2, 2, figsize=(20, 10), gridspec_kw=dict(width_ratios=[1, 2]))

sns.barplot(data=plot_data, y='Price', x='Is popular', ax=axs[0,0])
axs[0,0].set(ylabel="Mean price")

sns.kdeplot(plot_data[plot_data['Is popular']=="No"]['Price'], ax=axs[0,1], legend=False)
sns.kdeplot(plot_data[plot_data['Is popular']=="Yes"]['Price'], ax=axs[0,1])
axs[0,1].set(xlabel="Price")
axs[0,1].legend(title='Is popualar', loc='upper left', labels=['No', 'Yes'])

sns.barplot(data=plot_data, y='Old price', x='Is popular', ax=axs[1,0])
axs[1,0].set(ylabel="Mean old price")

sns.kdeplot(plot_data[plot_data['Is popular']=="No"]['Old price'], ax=axs[1,1], legend=False)
sns.kdeplot(plot_data[plot_data['Is popular']=="Yes"]['Old price'], ax=axs[1,1])
axs[1,1].set(xlabel="Old price")
axs[1,1].legend(title='Is popualar', loc='upper left', labels=['No', 'Yes'])

plt.show()

Price Reduction¶

The following charts suggest that non-discounted offers or those with a minimal price reduction receive more reviews on the website compared to those with a substantial price reduction. This may be due to the agency offering discounts on less popular tours to attract a larger number of customers. Another explanation could be that the extent of the discount is positively correlated with the tour's original price. Therefore, the larger the discount, the more expensive the tour, and as noted earlier, expensive tours are less popular than their more affordable counterparts.

In [21]:
raw_data["discount"] = np.where(raw_data["old_price"].isna(), 0, 1)
raw_data["discount_value"] = np.where(raw_data["old_price"].isna(), 0, 
                                         raw_data["old_price"] - raw_data["price"])
plot_data['Discount'] = raw_data['discount'].copy().map(binary_map)
plot_data['Discount value'] = raw_data['discount_value'].copy()

fig, axs = plt.subplots(1, 2, figsize=(20, 7), gridspec_kw=dict(width_ratios=[1, 2]))
ax = sns.countplot(x='Discount', hue='Is popular', dodge=1, 
                   data=plot_data, hue_order = ['No', 'Yes'], ax=axs[0])
ax.set(ylabel="Count")
sns.histplot(plot_data[plot_data['Discount value'] > 0],
             x='Discount value', hue="Is popular", ax=axs[1], legend=True, kde=True)
plt.show()

3. Data preprocessing¶

3.1. Binary variables¶

After an initial analysis of the dataset, it was decided to remove some columns as no impact on the response variable was observed. Among the deleted columns were, for example, the airport and the city to which the tour is organized (the column had too many unique values).

In [22]:
features = 'country, stars, rating, duration, food, price, month, multiple_destination, discount, is_popular'.split(", ")
final_data = raw_data[features]
final_data.head()
Out[22]:
country stars rating duration food price month multiple_destination discount is_popular
0 Grecja 2.000 4.500 8 Śniadania 1500.000 May 0 0 1
1 Grecja 2.000 5.000 8 Bez wyżywienia 1558.000 August 0 0 0
2 Grecja 2.000 4.900 8 Bez wyżywienia 1635.000 August 0 0 0
3 Grecja 2.500 5.200 8 Bez wyżywienia 1766.000 May 0 0 1
4 Chorwacja 2.500 3.700 8 Bez wyżywienia 1779.000 June 0 0 0

Since the majority of columns in the analyzed dataset consist of categorical variables, often with multiple possible values, it was decided to transform these variables into binary variables. Two columns were created for the countries Greece and Egypt because they frequently appear in the offers and exhibit a respectively high and low percentage of popular tours. Offers with breakfast and breakfast and dinner were also distinguished as those categories are characterized with the highest and lowest popularity, respectively. Additionally, a separate column to indicate whether the tour is organized during the holiday season (from May to August) or later was created.

In [23]:
final_data['greece'] = np.where(~final_data['country'].isin(['Grecja']), 0, 1)
final_data['egypt'] = np.where(~final_data['country'].isin(['Egipt']), 0, 1)
final_data['breakfast'] = np.where(final_data['food'].isin(['Śniadania']), 1, 0)
final_data['breakfast_dinner'] = np.where(final_data['food'].isin(['Śniadania i obiadokolacje']), 1, 0)
final_data['summer'] = np.where(final_data['month'].isin(['May', 'June', 'July', 'August']), 1, 0)
final_data.drop(columns=['country', 'food', 'month'], inplace=True)
In [24]:
final_data.head()
Out[24]:
stars rating duration price multiple_destination discount is_popular greece egypt breakfast breakfast_dinner summer
0 2.000 4.500 8 1500.000 0 0 1 1 0 1 0 1
1 2.000 5.000 8 1558.000 0 0 0 1 0 0 0 1
2 2.000 4.900 8 1635.000 0 0 0 1 0 0 0 1
3 2.500 5.200 8 1766.000 0 0 1 1 0 0 0 1
4 2.500 3.700 8 1779.000 0 0 0 0 0 0 0 1

3.2. Missing values imputation¶

Since a significant portion (477) of tours have no stars assigned by the agency, it was decided to approximate it using the average user ratings. Subsequently, the column with the average number of ratings was removed. It was observed that the average user rating corresponding to star ratings from 2 to 5 mostly corresponds wit a user rating from 4.9 to 5.4. Growth by 1 star corresponds with increase in a average customer rating approximately by 0.1. This pattern does not occur in every interval (sometimes the average rating decreases instead of increasing), but it was decided to fill in the missing observations in the "Stars" column in this way. Tours with an average rating below 4.9 were assigned 2 stars. Offers rated between 4.9 and 5 received 2.5 stars, and so on, up to offers rated above 5.4, which were assigned 5 stars. Since the method of filling in missing stars uses somewhat arbitrarily chosen threshold values and imputed values that are independent of the input set, the imputation of missing data was performed before dividing the dataset into training and testing sets.

In [25]:
final_data.isna().sum()
Out[25]:
stars                   477
rating                  587
duration                  0
price                     0
multiple_destination      0
discount                  0
is_popular                0
greece                    0
egypt                     0
breakfast                 0
breakfast_dinner          0
summer                    0
dtype: int64
In [26]:
starts_mean_rating = final_data.groupby(['stars']).mean()['rating'].reset_index()
starts_mean_rating["count"] = final_data.groupby(['stars']).count()['rating'].reset_index()['rating']
starts_mean_rating
Out[26]:
stars rating count
0 2.000 4.967 21
1 2.500 4.689 9
2 3.000 5.001 162
3 3.500 4.954 37
4 4.000 5.173 458
5 4.500 5.230 81
6 5.000 5.405 347
In [27]:
def input_missing_stars(row):
    stars = row['stars']
    if not math.isnan(stars):
        return stars
    rating = row["rating"]
    if rating is np.nan:
        return 3.5 # return mean number of stars
    for threshold, stars in list(zip(np.arange(49, 55)/10, np.arange(20, 50, 5)/10)):
        if rating < threshold:
            return stars
    return 5.0
In [28]:
final_data['stars'] = final_data.apply(input_missing_stars, axis=1)

After imputing missing values in the "Stars" columns, there are no more missing values in the dataset.

In [29]:
final_data.drop(columns=['rating'], inplace=True)
final_data.isna().sum()
Out[29]:
stars                   0
duration                0
price                   0
multiple_destination    0
discount                0
is_popular              0
greece                  0
egypt                   0
breakfast               0
breakfast_dinner        0
summer                  0
dtype: int64

4. Train-test split¶

In the next step, the dataset was divided into a training set and a test set in a 9:1 ratio. The "stratify" argument was used to ensure the same proportion of popular and unpopular tours in both sets. Below, descriptive statistics for both sets were compared to ensure that the observations in the sets are similar.

In [30]:
X_cols = list(final_data.columns)
X_cols.remove('is_popular')
X, y = final_data.loc[:, X_cols], final_data.loc[:, 'is_popular']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=seed, stratify=y)
In [31]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[31]:
((1888, 10), (210, 10), (1888,), (210,))
In [32]:
X_train.describe().T
Out[32]:
count mean std min 25% 50% 75% max
stars 1888.000 4.166 0.851 2.000 4.000 4.000 5.000 5.000
duration 1888.000 8.118 1.932 1.000 8.000 8.000 8.000 21.000
price 1888.000 5369.082 3235.064 1500.000 3035.000 4182.000 7142.750 31598.000
multiple_destination 1888.000 0.056 0.230 0.000 0.000 0.000 0.000 1.000
discount 1888.000 0.379 0.485 0.000 0.000 0.000 1.000 1.000
greece 1888.000 0.135 0.341 0.000 0.000 0.000 0.000 1.000
egypt 1888.000 0.060 0.238 0.000 0.000 0.000 0.000 1.000
breakfast 1888.000 0.249 0.433 0.000 0.000 0.000 0.000 1.000
breakfast_dinner 1888.000 0.197 0.397 0.000 0.000 0.000 0.000 1.000
summer 1888.000 0.367 0.482 0.000 0.000 0.000 1.000 1.000
In [33]:
X_test.describe().T
Out[33]:
count mean std min 25% 50% 75% max
stars 210.000 4.140 0.871 2.000 4.000 4.000 5.000 5.000
duration 210.000 8.310 2.401 3.000 8.000 8.000 9.000 22.000
price 210.000 5709.781 4177.980 1779.000 2947.250 4049.000 7434.000 31845.000
multiple_destination 210.000 0.067 0.250 0.000 0.000 0.000 0.000 1.000
discount 210.000 0.343 0.476 0.000 0.000 0.000 1.000 1.000
greece 210.000 0.114 0.319 0.000 0.000 0.000 0.000 1.000
egypt 210.000 0.057 0.233 0.000 0.000 0.000 0.000 1.000
breakfast 210.000 0.229 0.421 0.000 0.000 0.000 0.000 1.000
breakfast_dinner 210.000 0.200 0.401 0.000 0.000 0.000 0.000 1.000
summer 210.000 0.386 0.488 0.000 0.000 0.000 1.000 1.000
In [34]:
y_test.mean(), y_train.mean()
Out[34]:
(0.29523809523809524, 0.2960805084745763)

5. Data standarization¶

In this part, both the training and test sets were standardized using the mean and standard deviation calculated from the training set. Standardizing the data can be beneficial as it improves the numerical stability of the model and often shortens the construction time. Some of the methods used in this project require input data to be standardized.

In [35]:
def scale_train_test(X_train, X_test):
    """
    Calculates mean (m) and standard deviation (sd) 
    of a training set and standardizes both train and test set 
    According to the formula z = (x - m) / sd.
    """
    m, sd = X_train.mean(), X_train.std()
    X_train_scaled = (X_train - m) / sd
    X_test_scaled = (X_test - m) / sd
    return X_train_scaled, X_test_scaled
In [36]:
X_train_scaled, X_test_scaled = scale_train_test(X_train, X_test)

6. Model training¶

This section covers the construction of three models: Random Forest, Gradient Boosting Trees, and Support Vector Machines. All models were evaluated in terms of their classification abilities and compared. An attempt was also made to interpret them using relevant charts.

6.1. Random Forest¶

The first model constructed to identify popular Rainbow Tour trips is the Random Forest. The model utilizes multiple decision trees and makes decisions based on majority voting, i.e., the classification indicated by the majority of trees. During the construction of each decision tree node, only a small subset of all features is considered. This prevents the model from using the same, most influential predictors when building each tree. As a result, many uncorrelated decision trees are created, and their predictions are averaged to obtain the final classification decision.

Before training the model, hyperparameter tuning was conducted to select a combination of model parameters that resulted in the lowest classification error rate during cross-validation. This was done using random search, involving the exploration of randomly chosen values from a defined parameter space. The method identified the following optimal hyperparameters:

  • Number of trees in the forest: 300
  • Method for calculating the maximum number of variables to consider at each tree split: log2 of the total number of variables
  • Maximum tree depth: 4 levels
  • Minimum number of samples present in a leaf: 3
  • Minimum number of observations needed to split a node: 5

6.1.1. Hyperparameters tunning¶

In [37]:
# liczba drzew
n_estimators = [int(x) for x in np.linspace(50, 300, 6, endpoint=True)]
# metoda obliczania max. liczby zmiennych do rozważenia przy każdym podziale drzewa
max_features = ['log2', 'sqrt']
# głębokość drzewa
max_depth = [int(x) for x in np.linspace(1, 4, 4, endpoint=True)]
# minimalna liczba próbek obecnych w liściu
min_samples_leaf = [int(x) for x in np.linspace(1, 11, 5, endpoint=True)]
# minimalna liczba obserwacji potrzebna do podziału węzła
min_samples_split = [int(x) for x in np.linspace(5, 55, 5, endpoint=True)]

random_grid = {'n_estimators': n_estimators,
                'max_features': max_features,
                'max_depth': max_depth,
                'min_samples_leaf': min_samples_leaf,
                'min_samples_split': min_samples_split}
In [38]:
# forest = RandomForestClassifier(random_state=seed)

# rf_cv = RandomizedSearchCV(estimator = forest, 
#                            param_distributions = random_grid, 
#                            n_iter = 150, cv=5, verbose=2, 
#                            random_state=seed, n_jobs = -1)
# rf_cv.fit(X_train, y_train)
# rf_cv.best_params_
Fitting 5 folds for each of 150 candidates, totalling 750 fits
Out[38]:
{'n_estimators': 300,
 'min_samples_split': 5,
 'min_samples_leaf': 3,
 'max_features': 'log2',
 'max_depth': 4}
In [39]:
# with open('rf_cv.pickle', 'wb') as handle:
#     pickle.dump(rf_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)
In [38]:
with open('rf_cv.pickle', 'rb') as handle:
    rf_cv = pickle.load(handle)

6.1.2. Training the RF model¶

In [39]:
rf = RandomForestClassifier(n_estimators=rf_cv.best_params_['n_estimators'], 
                            min_samples_split=rf_cv.best_params_['min_samples_split'], 
                            min_samples_leaf=rf_cv.best_params_['min_samples_leaf'], 
                            max_features=rf_cv.best_params_['max_features'], 
                            max_depth=rf_cv.best_params_['max_depth'], random_state=seed)
rf.fit(X_train, y_train)
Out[39]:
RandomForestClassifier(max_depth=4, max_features='log2', min_samples_leaf=3,
                       min_samples_split=5, n_estimators=300, random_state=123)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=4, max_features='log2', min_samples_leaf=3,
                       min_samples_split=5, n_estimators=300, random_state=123)
In [40]:
y_train_pred = rf.predict(X_train)
y_pred = rf.predict(X_test)

6.1.3. Model assesment¶

In [41]:
class color:
    """Enables formatting printed strings."""
    RED = '\033[91m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
    END = '\033[0m'

def fevaluate_model(y_test, y_pred):
    """
    Returns dictionary of main metrics used to evaluate a classification model,
    Comaprind predictions with actual values.
    Takes two arguments: real Y values and predicted Y values respectively.
    """
    # acc, roc_auc
    acc = accuracy_score(y_pred, y_test)
    err = 1 - acc
    auc = roc_auc_score(y_pred, y_test)

    # calculate TN, FN, FP, TP
    TN, FN, FP, TP = confusion_matrix(y_test, y_pred).ravel()
    # calculate specificity, sensitivity, positive predictive value, and negative predictive value 
    recall  = TP / (TP+FN) # czułość
    specificity  = TN / (TN+FP) # swoistość
    PPV = TP / (TP+FP) # precision
    NPV = TN / (TN+FN)
    
    # F1 score
    F1 = 2*(recall*PPV)/(recall+PPV)
    
    perf_metrics = {
        "Accuracy": acc,
        "Error rate": err,
        "AUC": auc,
        "Recall": recall,
        "Specificity": specificity,
        "PPV": PPV,
        "NPV": NPV,
        "F1": F1,
    }
    
    return perf_metrics
        

def evaluate_model(clf, X_test, y_test, X_train=None, y_train=None, verbose=True, threshold=0.5):
    """
    Returns dictionary of metrics assessing model performance on test set and,
    Optionally, on train set for a given classification threshold. 
    Takes three mandatory arguments: trained model, X test set and y test vector.
    Optional arguments are: X train set, y train vector, boolean parameter indicating,
    If results should be printed (default True), and threshold value (default 0.5).
    """
    def eval_print(metrics_dict, set_name="test"):
        """Prints values of model performance metrcics stored in dictionary."""
        print(color.BOLD + f"Evaluation measures on {set_name} set:" + color.END)
        for key, value in metrics_dict.items():
            print(f"{key}: {round(value, 3)}")
        print("-"*40)
    
    perf_metrics = {}
    
    if X_train is not None and y_train is not None:
        # check model performance on train set
        y_pred_train = (clf.predict_proba(X_train)[:,1] >= threshold).astype(bool)
        train_metrics = fevaluate_model(y_pred_train, y_train)
        
        if verbose:
            eval_print(train_metrics, set_name="train")
        
        perf_metrics["Train"] = train_metrics
        
    # check model performance on test set
    y_pred_test = (clf.predict_proba(X_test)[:,1] >= threshold).astype(bool)
    test_metrics = fevaluate_model(y_pred_test, y_test)
        
    if verbose:
        eval_print(test_metrics)
        
    perf_metrics["Test"] = test_metrics
        
    return perf_metrics
In [42]:
rf_perf_metrics = evaluate_model(rf, X_test, y_test, X_train, y_train)
Evaluation measures on train set:
Accuracy: 0.728
Error rate: 0.272
AUC: 0.546
Recall: 0.1
Specificity: 0.992
PPV: 0.848
NPV: 0.724
F1: 0.179
----------------------------------------
Evaluation measures on test set:
Accuracy: 0.7
Error rate: 0.3
AUC: 0.506
Recall: 0.032
Specificity: 0.98
PPV: 0.4
NPV: 0.707
F1: 0.06
----------------------------------------
In [43]:
def plot_conf_matrix(y_test, y_pred):
    """
    Plots confusion matrix based on predicted
    And real values of target variable y.
    """
    color = 'black'
    cf_matrix = confusion_matrix(y_test, y_pred)
    group_names = ['TN','FP','FN','TP']
    group_counts = [round(value, 3) for value in cf_matrix.flatten()/cf_matrix.sum()]
    labels = [f'{v1}\n\n{v2}' for v1, v2 in zip(group_names, group_counts)]
    labels = np.asarray(labels).reshape(2, 2)
    ax = sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
    ax.set_ylabel('True labels', color=color)
    ax.set_xlabel('Predicted labels', color=color)
    ax.set_title('Confusion Matrix', color=color)
    plt.grid(False)
    plt.show()
    
plot_conf_matrix(y_test, y_pred)

The Random Forest classifier was trained for the parameters selected during hyperparameter tuning. The accuracy of the model predictions on the test set was 0.7, indicating that 70% of the trips were correctly classified. However, the above printout and confusion matrix suggest that the model performs very poorly in identifying popular tours. The Random Forest assigns almost all observations to the class of unpopular tours. Since the dataset is unbalanced, this allows achieving a relatively high classification accuracy by simply predicting the negative class only. On the other hand, a very low sensitivity result and F1-score indicate that the model struggles to identify the positive class at all. Sensitivity indicates that only 3.2% of popular tours were correctly identified by the model.

6.1.4. SMOTE Algorithm¶

To improve the Random Forest model's ability to identify objects from the positive class, the SMOTE method was applied. The SMOTE algorithm is an oversampling technique that, instead of simply replicating observations from the minority class, generates new points as a combination of current observations and their nearest neighbors. After transforming the training set by generating new representatives of the positive class, hyperparameter tuning was performed again, and a new Random Forest model was built.

In [44]:
smote = SMOTE(random_state=seed)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
X_train_scaled_smote, _ = scale_train_test(X_train_smote, X_train_smote)
y_train_smote.value_counts()
Out[44]:
is_popular
1    1329
0    1329
Name: count, dtype: int64
In [46]:
# rf_smote = RandomForestClassifier(random_state=seed)

# rf_smote_cv = RandomizedSearchCV(estimator = rf_smote, 
#                            param_distributions = random_grid, 
#                            n_iter = 150, cv=5, verbose=2, 
#                            random_state=seed, n_jobs = -1)
# rf_smote_cv.fit(X_train_smote, y_train_smote)
# rf_smote_cv.best_params_
Fitting 5 folds for each of 150 candidates, totalling 750 fits
Out[46]:
{'n_estimators': 250,
 'min_samples_split': 30,
 'min_samples_leaf': 11,
 'max_features': 'sqrt',
 'max_depth': 4}
In [47]:
# with open('rf_smote_cv.pickle', 'wb') as handle:
#     pickle.dump(rf_smote_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)
In [45]:
with open('rf_smote_cv.pickle', 'rb') as handle:
    rf_smote_cv = pickle.load(handle)

6.1.5. Random Forest after SMOTE oversampling¶

In [46]:
rf_smote = RandomForestClassifier(
    n_estimators=rf_smote_cv.best_params_['n_estimators'], 
    min_samples_split=rf_smote_cv.best_params_['min_samples_split'], 
    min_samples_leaf=rf_smote_cv.best_params_['min_samples_leaf'], 
    max_features=rf_smote_cv.best_params_['max_features'], 
    max_depth=rf_smote_cv.best_params_['max_depth'], random_state=seed
)
rf_smote.fit(X_train_smote, y_train_smote)
Out[46]:
RandomForestClassifier(max_depth=4, min_samples_leaf=11, min_samples_split=30,
                       n_estimators=250, random_state=123)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=4, min_samples_leaf=11, min_samples_split=30,
                       n_estimators=250, random_state=123)
In [47]:
y_train_pred = rf_smote.predict(X_train_smote)
y_pred = rf_smote.predict(X_test)
In [48]:
rf_smote_perf_metrics = evaluate_model(rf_smote, X_test, y_test, X_train_smote, y_train_smote)
Evaluation measures on train set:
Accuracy: 0.732
Error rate: 0.268
AUC: 0.732
Recall: 0.796
Specificity: 0.667
PPV: 0.705
NPV: 0.766
F1: 0.748
----------------------------------------
Evaluation measures on test set:
Accuracy: 0.595
Error rate: 0.405
AUC: 0.6
Recall: 0.613
Specificity: 0.588
PPV: 0.384
NPV: 0.784
F1: 0.472
----------------------------------------
In [51]:
plot_conf_matrix(y_test, y_pred)

The application of the SMOTE algorithm improved the model's ability to detect positive samples. The sensitivity of the model on the test set increased from 3.2% to 61.3%, meaning that 61.3% of popular trips are correctly identified by the model. On the other hand, the overall classification accuracy of the model decreased by 10% compared to the naive model that classifies all observations into the negative class. The new model has a much lower specificity, meaning that the increased ability to recognize objects from the positive class negatively impacted the percentage of correctly identified trips that are not popular.

6.1.6. Variables importance¶

The charts below depict the importance of individual variables in the decision-making process of Random Forest models trained on the original dataset and the dataset after applying the SMOTE algorithm, respectively. In both cases, the most significant variables are "Stars," "Price," and the type of food - "breakfast." However, in case of the first model, the length of the trip, the number of destinations, and the departure date also play a significant role in the classification. In the second model, these variables are not that crucial.

In [49]:
feature_names = X_test.columns
fig, ax = plt.subplots(ncols=2, figsize=(20, 5))

importances = rf.feature_importances_
forest_importances = pd.Series(importances, index=X_test.columns)
result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=seed)
rf_importances = pd.Series(result.importances_mean, index=feature_names)
forest_importances.plot.bar(yerr=result.importances_std, ax=ax[0], alpha=0.7, title="Random Forest")

importances = rf_smote.feature_importances_
forest_importances = pd.Series(importances, index=X_test.columns)
result = permutation_importance(rf_smote, X_test, y_test, n_repeats=10, random_state=seed)
rf_importances = pd.Series(result.importances_mean, index=feature_names)
forest_importances.plot.bar(yerr=result.importances_std, ax=ax[1], alpha=0.7, title="Random Forest + SMOTE")

plt.show()

6.1.7. Models interpretation¶

While the primary goal of building machine learning models is to predict outcomes based on input information as accurately as possible, the ability to understand how the classification occurred often proves equally important. To gain a better understanding of how the Random Forest models presented in this section make predictions, Partial Dependence Plots (PDP) and Ceteris Paribus (PCP) plots were used.

The Partial Dependence Plot (PDP) illustrates how the model's prediction changes for a specific instance by adjusting the value of a single variable, while keeping the other attributes constant. In a classification problem, the predicted probability of an object belonging to a certain class is affected.

To construct the PDP, an example trip was created, and the plot shows how the probability of it being classified as "popular" changes depending on the selected variables and the model used.

In [50]:
trip_1 = pd.DataFrame({
    "stars": [4],
    "duration": [5],
    "price": [4000],
    "multiple_destination": [0],
    "discount": [0],
    "greece": [1],
    "egypt": [0],
    "breakfast": [0],
    "breakfast_dinner": [0],
    "summer": [1],
})

Ceteris Paribus Plot¶

In [51]:
rf_exp = dx.Explainer(rf, X_train_smote, y_train_smote, label="RF")
rf_smote_exp = dx.Explainer(rf_smote, X_train_smote, y_train_smote, label="RF with SMOTE")

cp_trip_1 = rf_exp.predict_profile(trip_1)
cp_smote_trip_1 = rf_smote_exp.predict_profile(trip_1)
Preparation of a new explainer is initiated

  -> data              : 2658 rows 10 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2658 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : RF
  -> predict function  : <function yhat_proba_default at 0x00000195212293A0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0817, mean = 0.312, max = 0.633
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.531, mean = 0.188, max = 0.903
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 2658 rows 10 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2658 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : RF with SMOTE
  -> predict function  : <function yhat_proba_default at 0x00000195212293A0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.092, mean = 0.499, max = 0.837
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.78, mean = 0.000841, max = 0.883
  -> model_info        : package sklearn

A new explainer has been created!
Calculating ceteris paribus: 100%|█████████| 10/10 [00:00<00:00, 33.43it/s]
Calculating ceteris paribus: 100%|█████████| 10/10 [00:00<00:00, 39.08it/s]

The first thing to note is that the sample trip was classified into different classes depending on the model used. The first model classified it as unpopular (which is not surprising since this model classifies almost all trips this way), while the model trained on the dataset after applying the SMOTE algorithm classified it as popular. The plots show how the predictions of the models would change if the price or the number of stars assigned to the trip were altered. It turns out that the first model would classify the sample trip as popular if it had 5 stars. An increase in price by just under 1,000 would lead to a decrease in the probability of assignment to the positive class for both models by about 10%. However, in either case, it would not affect the final decision of the model.

In [52]:
cp_trip_1.plot(cp_smote_trip_1, variables = ['stars', 'price'])

Partial Dependency Plot¶

The purpose of the Partial Dependence Plot (PDP) is to illustrate the marginal effect of changes in the value of a selected explanatory variable on the average prediction of the model. The method is analogous to the previously described Ceteris Paribus Plot but shows the model's behavior across the entire dataset, essentially representing the average of PDP plots calculated for all observations.

The presented PDP plots are very similar to the Ceteris Paribus plots shown above. This means that the effect of changes in the price and the number of stars on the model predictions for a sample instance is very close to the average effect determined for all observations. This is likely because the sample trip is representative, and its values do not deviate significantly from the average on the test set. Additionally, the model's decisions are likely influenced primarily by the variables presented in the plots.

In [53]:
pd_rf = rf_exp.model_profile(variables = ['stars', 'price'])
pd_rf_smote = rf_smote_exp.model_profile(variables = ['stars', 'price'])

# pd_rf = rf_exp.model_profile(groups = 'class', variables = ['age', 'fare'])
pd_rf.plot(pd_rf_smote)
Calculating ceteris paribus: 100%|███████████| 2/2 [00:00<00:00,  2.71it/s]
Calculating ceteris paribus: 100%|███████████| 2/2 [00:00<00:00,  3.26it/s]

6.2. Gradient Boosting Trees¶

In the next step of the project, a gradient boosting classifiers model was built. In this method, a sample is drawn to build each subsequent tree, with a higher chance for objects that were previously misclassified. This allows the model to learn from its mistakes.

Since the algorithm allows for hyperparameter tuning, the first step was to conduct hyperparameter tuning to find their optimal subset. The same values as for the random forest were considered, with two additional parameters, "nsamples" and "learning rate" added.

The model was trained on the dataset after applying SMOTE because a significant improvement in the model's ability to identify positive class observations was observed in the previous part after using SMOTE.

6.2.1. Hyperparameters tunning¶

The tuning procedure identified the following optimal hyperparameter set:

  • Number of trees in the forest: 200
  • Method for calculating the maximum number of variables to consider at each split: square root of the total number of variables
  • Maximum tree depth: 4 levels
  • Minimum number of samples in a leaf: 1
  • Minimum number of observations needed to split a node: 30
  • Learning rate: 0.3
  • Subsample ratio: 0.9
In [54]:
# liczba drzew
n_estimators = [int(x) for x in np.linspace(50, 300, 6, endpoint=True)]
# metoda obliczania max. liczby zmiennych do rozważenia przy każdym podziale drzewa
max_features = ['log2', 'sqrt']
# głębokość drzewa
max_depth = [int(x) for x in np.linspace(1, 4, 4, endpoint=True)]
# minimalna liczba próbek obecnych w liściu
min_samples_leaf = [int(x) for x in np.linspace(1, 11, 5, endpoint=True)]
# minimalna liczba obserwacji potrzebna do podziału węzła
min_samples_split = [int(x) for x in np.linspace(5, 55, 5, endpoint=True)]
# hiperparmetry wzmacniania gradientowego
subsample = [0.6, 0.7, 0.8, 0.9]
learning_rate = [0.005, 0.01, 0.05, 0.1, 0.2, 0.3]

random_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_leaf': min_samples_leaf,
    'min_samples_split': min_samples_split,
    'learning_rate': learning_rate,
    'subsample': subsample
}
In [55]:
# gb_smote = GradientBoostingClassifier(random_state=seed)

# gb_smote_cv = RandomizedSearchCV(
#     estimator = gb_smote, 
#     param_distributions = random_grid, 
#     n_iter = 150, cv=5, verbose=2, 
#     random_state=seed, n_jobs = -1
# )

# gb_smote_cv.fit(X_train_smote, y_train_smote)
# gb_smote_cv.best_params_
Fitting 5 folds for each of 150 candidates, totalling 750 fits
Out[55]:
{'subsample': 0.9,
 'n_estimators': 200,
 'min_samples_split': 30,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 4,
 'learning_rate': 0.3}
In [56]:
# with open('gb_smote_cv.pickle', 'wb') as handle:
#     pickle.dump(gb_smote_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)
In [45]:
with open('gb_smote_cv.pickle', 'rb') as handle:
    gb_smote_cv = pickle.load(handle)

6.2.2. Training the Gradient Boosting classifier¶

In [57]:
gb = GradientBoostingClassifier(
    n_estimators=gb_smote_cv.best_params_['n_estimators'], 
    min_samples_split=gb_smote_cv.best_params_['min_samples_split'], 
    min_samples_leaf=gb_smote_cv.best_params_['min_samples_leaf'], 
    max_features=gb_smote_cv.best_params_['max_features'], 
    max_depth=gb_smote_cv.best_params_['max_depth'], 
    learning_rate=gb_smote_cv.best_params_['learning_rate'],
    subsample=gb_smote_cv.best_params_['subsample'],
    random_state=seed
)
gb.fit(X_train_smote, y_train_smote)
Out[57]:
GradientBoostingClassifier(learning_rate=0.3, max_depth=4, max_features='sqrt',
                           min_samples_split=30, n_estimators=200,
                           random_state=123, subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.3, max_depth=4, max_features='sqrt',
                           min_samples_split=30, n_estimators=200,
                           random_state=123, subsample=0.9)

6.2.3. Model assesment¶

In [58]:
y_train_pred = gb.predict(X_train_smote)
y_pred = gb.predict(X_test)
In [59]:
gb_perf_metrics = evaluate_model(gb, X_test, y_test, X_train_smote, y_train_smote)
Evaluation measures on train set:
Accuracy: 0.905
Error rate: 0.095
AUC: 0.905
Recall: 0.878
Specificity: 0.932
PPV: 0.928
NPV: 0.884
F1: 0.903
----------------------------------------
Evaluation measures on test set:
Accuracy: 0.69
Error rate: 0.31
AUC: 0.607
Recall: 0.403
Specificity: 0.811
PPV: 0.472
NPV: 0.764
F1: 0.435
----------------------------------------
In [60]:
plot_conf_matrix(y_test, y_pred)

The above classifier achieved an accuracy of 0.69 on the test set, indicating that 69% of trips were correctly classified. This represents a significant improvement compared to the random forest model. The above printout and confusion matrix indicate that the model performs much better in predicting non-popular trips. This is evidenced by the low sensitivity and F1-score. A sensitivity of 40.3% indicates that only this percentage of popular trips was correctly identified by the model. Specificity, on the other hand, indicates that 81.1% of non-popular trips were correctly classified.

6.2.4. Model interpretation¶

For the analysis of the decision-making process of the model created in this section, break-down plot and Shapley value plot for additive attributions were used.

The Break-Down plot helps understand the additive impact of individual variable values on the model's final decision by decomposing the predicted probability into components representing the contribution of each attribute. The plot aims to capture the contribution of each feature by illustrating the change in the average prediction caused by setting the values of each explanatory variable. Initially, the BDP shows the average model prediction for the entire dataset. Each subsequent level of the plot shows the average change in prediction caused by setting the value of the next independent variable. The last row contains the sum of the average model prediction and the contributions of individual variables, i.e., the predicted probability value for the considered observation.

To construct both plots, the artificial trip offer constructed in the previous section was used.

Break-down plot for additive attributions¶

In [61]:
gb_exp = dx.Explainer(gb, X_train_smote, y_train_smote, label="Gradient Boosting Trees")
bd_trip = gb_exp.predict_parts(trip_1, type='break_down')
Preparation of a new explainer is initiated

  -> data              : 2658 rows 10 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2658 values
  -> model_class       : sklearn.ensemble._gb.GradientBoostingClassifier (default)
  -> label             : Gradient Boosting Trees
  -> predict function  : <function yhat_proba_default at 0x00000195212293A0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 8.41e-05, mean = 0.501, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.787, mean = -0.00134, max = 0.989
  -> model_info        : package sklearn

A new explainer has been created!

The chart below shows the contribution of all variables to the final prediction of the model. We can observe that the model considered the sample trip as popular with a probability of 56.5%. The most significant positive impact (increasing the likelihood of classifying the trip as popular) can be attributed to the trip's price of 4000 PLN, a star rating of 4, and a duration of 5 days. The probability is significantly reduced by the fact that the trip is to Greece and takes place in the summer season.

In [62]:
bd_trip.plot()

Shapley plot for average attributions¶

The Shapley plot for average attributions solves the issue present in the case of break-down plots, i.e., the dependency of the impact of individual variables on their order on the chart. The Shapley plot addresses the problem of dependence on the order of contributions by averaging the contributions of a variable for all (or many) possible orders.

In the chart below, we can see that the contribution of most variables is similar to that presented in the break-down plot. An important difference is the number of stars, which turned out to have a much less significant impact (and, moreover, negative) than indicated by the above chart. Instead, the type of meals and the fact that the price is not discounted turned out to be significant for classification decision.

In [63]:
shap_trip_1 = gb_exp.predict_parts(trip_1, type='shap')
shap_trip_1.plot()

7. Models comparison¶

In [64]:
def update_master_results(model_names, model_version, models_pm, master_results):

    for i, model_name in enumerate(model_names):
        test_pm = models_pm[i].get("Test")
        acc = test_pm.get("Accuracy")
        recall = test_pm.get("Recall")
        specificity = test_pm.get("Specificity")
        ppv = test_pm.get("PPV")
        f1 = test_pm.get("F1")
        new_row = [model_name, model_version, acc, recall, specificity, ppv, f1]
        master_results.loc[len(master_results)] = new_row
    
    return master_results    

# data frame do porównania wyników klasyfikacji
master_results = pd.DataFrame(columns=["Model", "Class imbalance approache", "Accuracy", "Recall", "Specificity", "Precision", "F1"])

model_names = ["Las losowy"]
model_version = "-"
models_pm = [rf_perf_metrics]
master_results = update_master_results(model_names, model_version, models_pm, master_results)

model_names = ["Las losowy", "Gradient Boosting Trees"]
model_version = "SMOTE"
models_pm = [rf_smote_perf_metrics, gb_perf_metrics]
master_results = update_master_results(model_names, model_version, models_pm, master_results)

The table below presents the classification results of all three models. Among all classifiers, Gradient Boosting Trees with the SMOTE algorithm achieves the best results. The algorithm correctly classifies 69% of observations, but accurately identifies only 40.3% of trips that are popular among the agency's clients. Slightly better in identifying objects belonging to the positive class is Random Forest with the SMOTE algorithm, which achieves an accuracy of 59.5%, but correctly identifies 61.3% of popular trips. On the other hand, compared to Gradient Boosting Trees, it performs worse in detecting unpopular trips. Random Forest without the SMOTE algorithm performs the worst.

In [65]:
master_results
Out[65]:
Model Class imbalance approache Accuracy Recall Specificity Precision F1
0 Las losowy - 0.700 0.032 0.980 0.400 0.060
1 Las losowy SMOTE 0.595 0.613 0.588 0.384 0.472
2 Gradient Boosting Trees SMOTE 0.690 0.403 0.811 0.472 0.435
In [66]:
def spider_plot(metrics_lists, labels, plot_title):
    categories = ['Accuracy', 'Recall', 'Specificity', 'Precision', 'AUC']
    N = len(categories)
    angles = [n / float(N) * 2 * np.pi for n in range(N)]
    angles += angles[:1]

    plt.figure(figsize=(10, 10))
    ax = plt.subplot(111, polar=True)
    ax.set_theta_offset(np.pi / 2)
    ax.set_theta_direction(-1)
    plt.xticks(angles[:-1], categories, size=15)
    ax.set_rlabel_position(0.3)
    plt.yticks([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], color="grey", size=12)
    plt.ylim(0, 1)
    
    colors = ["blue", "green", "red", "black"]
    for i in range(len(metrics_lists)):
        values = metrics_lists[i]
        values += values[:1]
        ax.plot(angles, values, linewidth=1.1, linestyle='solid', label=labels[i], color=colors[i])
        
    plt.title(plot_title, size=20, y=1.05)
    plt.legend(prop={'size': 15}, loc='lower left')
    plt.show()
    
metrics_lists = [list(r)[2:] for r in master_results.values.tolist()]
labels = ["Las losowy", "Las losowy + SMOTE", "Gradient Boosting Trees + SMPTE"]
plot_title = "Porównanie modeli"
spider_plot(metrics_lists, labels, plot_title)

Summary¶

The aim of this project was to build machine learning classifiers capable of predicting with satisfactory accuracy the popularity of Rainbow Tour's summer 2023 offerings. As a result, three models were built:

  • Random Forest
  • Random Forest combined with SMOTE oversampling technique
  • Gradient Boosting combined with SMOTE oversampling technique

The model that achieved the best classification quality was Gradient Boosting classifier in combination with the SMOTE algorithm, with an accuracy of 69%, sensitivity of 40%, and specificity of 81%. It was observed that without oversampling, the models classified a significant majority of trips as unpopular. Thanks to the application of SMOTE, the models were able to better identify observations belonging to the positive class. The Gradient Boosting model achieved a worse classification measure only for sensitivity compared to the Random Forest. The Random Forest model with the SMOTE algorithm has about 20% better ability to correctly classify popular trips among customers.

In addition to achieving high classification quality, the project aimed at interpreting the built models to understand which parameters of trips influence their popularity among agency customers and how. The project led to the following conclusions:

  1. The most important predictor of trip popularity is the number of stars assigned to the trip by the agency. Popularity increases with the number of stars, but decreases when the number is 5. This relationship may be bidirectional. Customers may prefer trips with a high rating from the agency. On the other hand, Rainbow may better rate trips that are popular among customers.
  2. The price of the trip and whether it has been discounted are significant factors in assessing the popularity of the trip. Typically, cheaper trips are more eagerly purchased. On the other hand, a price reduction negatively affects the popularity of the trip. The reason may be that the agency mainly lowers the prices of less popular trips to encourage their purchase.
  3. Trips to Greece are less popular than to other countries.
  4. Trips organized in the autumn-winter period are more popular than those in the spring-summer period.
  5. Breakfast is a common type of catering on popular trips.

Moreover, the following action could be taken in order to improve the model's performance:

  1. More explanatory variables can be created from the existing features, such as number of trip destinations or percentage discount, that might turn out valuable for the model making classification.
  2. Different models should be tested besides Random Forest and Gradient Boosting to obtain a relaible benchmark of different methods' results.
  3. Different oversampling and undersampling techniques should be tested on the dataset.
  4. More extended and sophisticated techniques of model interpretation could be use to better understand patterns driving the travel industry.